AI Security
AIディフェンス研究所
Security camp感想
AIJack
PySyft
Generative AI and Large Language Models for Cyber Security: All Insights You Need
Security of LLM Information Hub
TrustLLM: Trustworthiness in Large Language Models
Breaking Down the Defenses: A Comparative Survey of Attacks on Large Language Models
A Survey on Large Language Model (LLM) Security and Privacy: The Good, the Bad, and the Ugly
Golden Gate Claude
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
Improving Alignment and Robustness with Circuit Breakers
OpenAI APIで、問題のある発言を検出するmoderationモデルを試してみた
ChatGPT "DAN" (and other "Jailbreaks")
Universal and Transferable Adversarial Attacks on Aligned Language Models
NeMo-Guardrails
LLMにおけるガードレールについて
【2024.9.9 AIアライメントネットワーク設立記念シンポジウム】#1「ALIGNの挑戦」髙橋恒一(ALIGN代表理事)
https://www.youtube.com/watch?v=_13ORbYifbU&t=910s
Singular Learning TheoryとAI Alignmentが結びつくのが面白い。そろそろ寝なければと思っていたのに、眠れなくなった。アラインメントとFree Energyが結びつき、脳のことにまで発想が広がってきた。
Hacking Back the AI-Hacker: Prompt Injection as a Defense Against LLM-driven Cyberattacks
多文化・他言語対応の安全な大規模言語モデルの構築を目指して
https://www.youtube.com/watch?v=NLaayZ4v6Ag
LLMのアウトプットをバリデーションする関数が集まるGuardrails Hubを試す
BlackDAN: A Black-Box Multi-Objective Approach for Effective and Contextual Jailbreaking of Large Language Models
SLM as Guardian: Pioneering AI Safety with Small Language Models
LLM Agent Honeypot: Monitoring AI Hacking Agents in the Wild